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Abstract.  Given  ajfixcd  set  5  of  n  keys,  we  would  Bke  to  store  them  so  that  queries  of  the 
form^bx  €  y*  can  be  answered  quickly.  A  commonly  employed  scheme  to  solve  this  problem 
uses  a  table  to  store  the  keys,  and  a  special  purpose  program  depending  on  5  which  probes  the 
table.  We  analyte  the  tradeoff  between  the  maximum  number  of  probes  allowable  to  answer 
a  query,  and  the  information-theoretic  complexity  of  the  program  to  do  so.  Perfect  hashing 
(where  the  query  must  be  answered  in  one  probe)  has  a  program  complexity  of(njoz:  e(l  +  ©(ijj) 
bits,  and  this  lower  bound  can  be  achieved.  Under  a  model  combining  perfcct~Tias KT u g~ana 
binary  search  methods,  it  is  shown  that  for  k  probes  to  the  table,  I  4-  o(1))  bits 

are  necessary  and  sufficient  to  describe  a  table  searching  algorithm.  This  model  gives  come 
information- theoretic  bounds  on  the  complexity  of  searching  an  external  memory. -We  examine 
some  schemes  where  pointers  are  allowed  in  the  table,  and  show  that  for  k  probes  to  the 
table,  about  +  o(l))  bits  arc  necessary  and  sufficient  to  describe  the  search^ Finally, 

wc  prove  some  lower  bounds  on  the  worst  case  performance  of  hash  functions  described  by 
bounded  Boolean  circuits,  and  worst  case  performance  of  universal  classes  of  bash  functions. 
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ABSTRACT.  Given  a  fixed  set  S  of  n  keys,  we  would  like  to  store  them  so  that  queries  of  the 
form  “Is  x  €  5?”  can  be  answered  quickly.  A  commonly  employed  scheme  to  solve  this  problem 
uses  a  table  to  store  the  keys,  and  a  special  purpose  program  depending  on  S  which  probes  the 
table.  We  analyze  the  tradeoff  between  the  maximum  number  or  probes  allowable  to  answer 
a  query,  and  the  information-theoretic  complexity  of  the  program  to  do  so.  Perfect  hashing 
(where  the  query  must  be  answered  in  one  probe)  has  a  program  complexity  of  «log2  e(l  +o(l)) 
bits,  and  this  lower  bound  can  be  achieved.  Under  a  model  combining  perfect  hashing  and 
binary  search  methods,  it  is  shown  that  for  k  probes  to  the  table,  nfc/2*+,(l  4-  o(l))  bits 
are  necessary  and  sufficient  to  describe  a  table  searching  algorithm.  This  model  gives  come 
information-theoretic  bounds  on  the  complexity  of  searching  an  external  memory.  We  examine 
some  schemes  where  pointers  are  allowed  in  the  table,  and  show  that  Tor  k  probes  to  the 
table,  about  jjjpjffjHl  +  o(l ))  bits  are  necessary  and  sufficient  to  describe  the  search.  Finally, 

we  prove  some  lower  bounds  on  the  worst  case  performance  of  hash  functions  described  by 
bounded  Boolean  circuits,  and  worst  case  performance  of  universal  classes  of  hash  functions. 
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Given  a  fixed  set  S  of  n  keys,  we  would  like  to  store  them  so  that  membership  queries 
of  the  form  “Is  x  £  S?”  can  be  answered  quickly.  This  searching  problem,  particularly  in  the 
static  case,  is  clearly  among  the  most  fundamental  of  data  structuring  problems,  as  well  as 
being  ubiquitous  in  computer  science  Applications. 


This  simple  information- retrieval  problem  has  generated  considerable  interest  in  recent 
years.  The  papers  [Sprugnoli]  and  [Jaeschke]  suggest  several  ad  hoc  hashing  schemes  to  imple¬ 
ment  a  solution.  Jaeschke  recommends  using  a  hash  function  of  the  form  h(x)  —  IbAcJ 
(mod  n),  where  the  constants  A,  D,  and  C  depend  on  5.  This  function  is  called  “perfect”  as 
no  keys  in  S  collide  under  h,  and  h  has  a  very  reasonable  unit-cost  arithmetic  complexity. 
Its  program  complexity,  on  the  other  hand,  is  very  large,  as  the  number  of  bits  needed  to 
write  A,  B,  and  C  is  O(n).  By  contrast,  binary  search  has  a  program  complexity  of  O(logn) 
and  answers  queries  in  O(logn)  probes.  This  relation  between  search  time  (measured  in  num¬ 
ber  of  probes)  and  program  complexity  (measured  in  bits)  suggests  that  there  should  be  an 
information-theoretic  tradeoff  between  the  two.  The  tradeoff  intuitively  corresponds  to  the 
relationship  of  the  performance  of  a  search  strategy  to  the  inherent  complexity  of  its  descrip¬ 
tion.  The  relationship  has  appeared  in  a  variety  of  disguises,  including  the  following: 


1.  Searching  External  Memories.  Gonnct  and  LarBon  recently  examined  the  problem 
of  searching  an  external  memory  with  limited  internal  storage  [Gonnct  and  Larson].  Since 
accessing  external  memory  is  very  time-consuming,  methods  which  reduce  the  number  of 
accesses  are  very  desirable,  even  if  they  increase  internal  processing.  Gonnet  and  Larson 
examine  external  hashing  techniques  assuming  random  probing,  where  a  small  amount  of 
internal  storage  (a  directory)  is  used  to  help  direct  the  search.  How  does  the  site  of  the 
directory  affect  the  efficiency  of  the  search?  If  membership  queries  are  answered  by  retrieving 
pages  of  the  external  memory,  our  above  described  tradeoff  represents  the  relationship  between 
the  directory  size  and  the  number  of  uniform-sixe  pages  needed  to  store  S.  We  determine  this 
tradeoff. 


2.  Internal  Searching  in  a  constant  number  of  probes.  Several  investigations  have  been 
made  in  this  direction,  including  [Tarjan  and  Yao]  and  more  recently  [Frcdman  et  al.]  Both 
of  their  results  show  practical  schemes  for  answering  queries  in  0(1)  probes  with  a  program 
complexity  of  O(nlogn).  Another  paper  [Yao]  demonstrates  a  “canonical  2-probe  structure” 
which  always  probes  first  to  a  table  position  containing  a  directory  for  the  rest  of  the  table,  and 
uses  information  in  the  directory  to  choose  the  next  probe.  Again,  the  size  of  the  directory  is  the 
program  complexity  of  the  search  strategy.  We  analyze  the  worst  case  of  hashing  with  separate 
chains,  and  determine  the  tradeoff  between  program  complexity  and  chains  of  maximum  length 
k,  so  queries  arc  answered  in  no  more  than  k  probes. 


3.  Probabilistic  Hashing  and  Hash  Circuits.  Recent  work  on  probabilistic  hashing  by 
Carter  and  Wegman  suggests  choosing  a  hash  function  at  random  from  a  class  of  functions 
#  having  a  "universal”  property  (Carter  and  Wegman].  The  property  guarantees  certain 
desirable  input  independent  expected  bounds  on  search  time  (measured  in  number  of  probes), 
randomising  over  the  choice  of  function.  The  program  complexity  (roughly  log3 1#|)  of  these 
hash  functions  was  analyzed  in  [Mehihorn].  It  is  of  interest  to  minimise  log3  |#j  for  several 
reasons:  note  that  loga  |k|  is  a  measure  of  how  many  coin  (lips  are  needed  to  randomly 
choose  a  function  from  |#|,  which  becomes  important  in  some  applications.  We  show  that 
for  such  minimal  )i,  there  are  some  worst-case  input  dependent  lower  bounds  on  search  time 
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•rntsf  £j_).  Similar  lower  bounds  can  be  derived  for  bounded-size  Boolean  hash  circuits. 

Program  complexity  (sec  [Chaitin])  is  a  measure  which  has  largely  been  applied  to 
problems  in  formal  systems,  “machine  based”  recursive  function  theory,  and  information'1 
theory.  We  intend  to  use  it  as  a  tool  in  concrete  complexity,  applying  it  to  a  particular  infor¬ 
mation  retrieval  problem.  As  long  as  we  assume  a  basic  random  access  machine  (RAM)  model, 
this  measure  of  program  complexity  is  fundamentally  independent  of  language  implementation, 
so  that  it  does  not  matter  whether  a  search  strategy  to  answer  membership  queries  is  described 
in  assembly  language,  PASCAL,  or  French. 

J 


§2  Model  and  Notation 

The  problem  is  formalized  as  follows.  We  construct  a  special  purpose  computer 
program  P  depending  on  S  and  store  the  elements  of  5  in  a  table,  possibly  with  some  associated 
pointer  structure.  To  answer  “Is  z  €  SV  P  is  allowed  to  probe  the  table  in  its  search  for  z, 
and  make  auxiliary  computations  (following  pointers  if  they  are  permitted)  between  probes.  If 
P  finds  x  in  the  table,  it  answers  “yes.”  If  it  does  not  find  z,  but  gathers  sufficient  information 
to  certify  that  x  is  not  in  the  table,  it  answers  “no.”  In  addition,  we  use  the  following  notation: 

M  s=  {0,1 . m  -  1}.  The  key  space. 

The  subsets  of  M  of  cardinality  n. 

S  €  M(nK  The  set  in  question.  We  assume  that  M  is  much  larger  than  S,  which  is  reasonable 
in  most  applications. 

N  =  {0, 1, . .  ■ ,  w  —  1}.  The  address  space  of  the  table. 

The  partitions  of  M  into  n  parts  (possibly  empty  parts). 

£.  We  use  P  ambiguously  to  denote  both  partitions  of  M  as  well  as  program  encodings  of 
search  strategies,  because  there  is  a  direct  and  significant  correspondence  between  the  two. 
Which  meaning  of  P  is  intended  should  be  clear  from  context. 


§3  Hashing  and  Partitions:  Combinatorial  Preliminaries 

Every  hash  function  program  P  :  M  —*  N  induces  a  partition  of  the  key  space  M 
into  partis  Mi,  where  (J,  A^  =  M  and  Mi  =  {*  6  M\P{x)  =  *'}.  We  can  examine  a  partition 
property  A(P,S)  of  a  program  (partition)  P  and  a  set  S,  and  determine  whether  or  not  the 
property  is  satisfied.  For  example,  “no  two  elements  of  S  are  found  in  the  same  part  of  the 
partition  defined  by  P”  (the  perfect  hashing  property),  or  “no  more  than  k  dements  of  S  arc 
found  in  the  same  part  of  the  partition  defined  by  P."  Given  any  set  S  €  M^  we  want 
a  program  P  such  that  4(P,S)  holds.  Define  C{A)  as  the  bit  complexity  of  the  program 
(partition)  P  satisfying  A[P,8). 

We  can  then  generate  lower  bounds  through  the  following  counting  argument: 
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Theorem  1.  Let 

Qa  —  ^max}  | {S  G  M{n)\A(P,  S)  ho/ds}|. 

Then  for  any  S  €  M^n\  the  bit  complexity  of  a  program  satisfying  A(P,  S)  is  bounded  below 

by  * 

<?(*)>  to*.  [(")/«*]• 

The  following  probabilistic  argument  lets  us  construct  upper  bounds. 

Theorem  2.  Let  A(P,  S)  be  a  partition  property  over  the  class  pW  of  partitions,  where 

Pr  { A(P,S )  bolds)  >  p(m,n). 

Then  there  exists  a  set  X  =  {Pi,Pa,...,P*}  C  pW  of  fc  partitions  where  k  —  nlgm,  and  the 
following  is  true:  for  any  S  G  there  exists  a  P  £  )l  such  that  A(P,  S)  holds. 

Proof.  Suppose  we  choose  the  partitions  in  X  independently  at  random.  Let  A 5,  S  G  A/W 
be  the  event 

^  (/(P,  5)  does  not  hold). 

Then 

Pr  {As}<(l-p)‘, 

and 

Pr{  V  E  Pr  {As}  < 

If 

n 

then  it  is  possible  to  deterministically  choose  an  #  making  VseM<"'  false,  and 

V  Ag  *  ^  (3P  e  #)(*(P,  S)  holds), 
a€*r<->  *€*/<•> 

which  would  prove  the  theorem.  Since  kp  «  nlnm,  we  know 

*ln(l  -  p)  +  n In m  <  -kp -  ^  +  n In m  =  - ~  <  0, 

2  2 

or  m*(!  —  p)*  <  I,  which  implies  (*),  proving  the  theorem.  | 
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We  shall  sec  that  if  P  is  chosen  randomly  from  pM,  and  Pr  {A(P,  S)  holds}  > 
p(ra,n),  then  Theorem  2  will  let  us  show  upper  bounds  of  C(A)  <  log2  aJua  +  O(logn  + 
log  log  m). 


§4  Some  Applications 


We  first  examine  perfect  hashing,  where  P  may  probe  exactly  one  table  location  in 
its  search  to  answer  the  query,  “Is  *  €  5?”  If  *  appears  there,  P  answers  “yes”;  if  x  does  noL 
appear,  P  must  be  justified  in  its  saying  “no”  without  fear  of  the  appearance  of  x  somewhere 
else  in  the  table.  We  would  like  to  know  exactly  how  many  bits  arc  needed  to  define  such  a 
perfect  hash  function  for  every  S  £  M^n\  denoted  PlIF(m,n). 


Theorem  3. 


PHF(m,  n)  >  n  log2  e  —  -  log2  2xn  +  0 

it 


Proof.  We  use  the  counting  argument  of  Theorem  1.  Let  P  be  a  program  which  is  a  perfect 
hash  function.  Think  of  P  as  a  function  P  :  M  -*  N  which  maps  each  key  in  the  key  space  onto 
a  particular  address  in  the  table.  P  then  induces  a  partition  of  M  into  parts  M0,  Mi,...,  Afn_i, 
where  U.Mi  =  M  and  M,  =  {x  €  M\P{z)  =  i}.  If  P  is  a  perfect  hash  function  for  S,  clearly 
no  two  keys  in  5  can  be  in  the  same  Mi,  since  P  restricted  to  5  gives  the  different  addresses 
of  the  elements  of  S  in  the  table.  Define 

Perf(P)  s=  |{5  £  M^\P  is  a  perfect  hash  function  for  5}|. 

Let  r.  be  a  set  of  search  programs  such  that  for  any  .S  6  M^n\  there  exists  a  P  £  h  which  is 
perfect  for  S.  Then  by  counting, 

^max  Perf(P)^  |jf|  >  =  number  of  sets  of  size  n. 

If  P  partitions  M  into  parts  Mo,  M|,...,M„_]  then  clearly  Pcrf(P)  =  rio<«<n-i  and 
this  quantity  is  maximized  when  each  M,-  is  of  equal  size,  so  that  jA/.j  =  ® 7 Then 


or 


Applying  Stirling’s  formula,  taking  logarithms,  and  assuming  n*  =  o(m),  we  obtain  the  desired  . 
lower  bound:  clearly  log2 1#|  bits  arc  necessary  or  else  wc  cannot  uniquely  identify  all  the 
programs  in  M.  | 
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The  counting  argument  of  Theorem  3  shows  that  sets  S  €  exist  with  perfect 
hash  functions  that  have  complexity  of  at  least  nlog,  e(l  +  o(l)).  Of  course,  some  sets,  such 
as  {0, 1,2, ...,n  —  1},  have  perfect  hash  functions  of  low  complexity.  However,  the  counting 
argument,  as  in  Chaitin-Kolmogorov  complexity,  is  quite  strong  in  the  following  sense:  no  more 
than  2~*( ”)  of  the  sets  5  €  can  have  perfect  hash  functions  with  program  complexity 
less  than  PIIF(m,n)  —  k.  Therefore,  both  in  Theorem  3  and  in  counting  arguments  we  will 
see  later,  a  lower  bound  of  C  on  program  complexity  for  perfect  hashing,  or  analogous  search 
properties,  indicates  that  at  least  half  of  the  sets  S  €  Af have  program  complexity  of  at 
least  C  —  1,  etc. 

Theorem  4. 

3 

PHF(m,  n)  <  n  log,  e  +  -  loga  n  +  2  log2  log,  m  +  0(1). 

Proof.  Let  A(P,S)  be  the  perfect  hashing  property.  We  know  that 


_  ,  ...  ...  n!n  n  n! 

Pr  {A(P,  S)  holds}  = - =  — . 

nm  nn 

By  Theorem  2,  we  know  it  is  possible  to  choose  a  set  X  of  partitions  so  that  every  5  €  Af*") 
has  a  perfect  hash  function  chosen  from  M,  where 


|#|  =  —nln  m, 

Tl! 


To  prove  the  theorem,  it  only  remains  to  show  that  a  program  computing  a  partition  in  U  can 
indeed  be  written  in  about  log8|#|  bits.  This  can  be  done  using  a  variant  of  the  algorithm 
suggested  in  [Mehlhorn],  which  we  now  describe.  The  idea  of  the  program  is  that  every  class  of 
hash  functions  can  be  specified  by  an  m  X  |H|  matrix  M,  where  M$ty  =  hy(t),  that  is,  the  j-th 
hash  function  in  M  applied  to  t.  It  turns  out  that  very  short  programs  can  enumerate  these 
matrices,  and  check  them  for  properties  like  perfect  hashing. 

PROGRAM  Perfects  (*): 

1.  b*-  flog,  m]  written  in  binary; 

2.  j  «-  some  number  between  1  and  |X|  depending  on  S; 

3.  Scorch  through  all  2*  X  1, 2*  X  2, 2*  X  3, . . .  matrices  in  some  lexicographic  order 
with  entries  In  {0, . . . ,  n  -  1}  until  a  "perfect  matrix "  M  is  found.  Column  j  of  M  represents 
the  perfect  bash  function  for  S:  probe  address  of  the  table.  If  *  appears  there,  return 
"yes,”  otherwise  return  "no." 

The  length  of  the  above  program  la  log,  log,  m  for  step  1,  at  most  flog,  |J/f)  for  stop 
2,  and  log,  n  +  0(1)  for  stop  3,  which  proves  the  theorem.  Note  the  same  perfect  matrix  is 


» 
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always  found,  since  the  matrices  are  enumerated  in  the  same  lexicographic  order:  this  is  why 
j  can  in  principle  be  determined  beforehand.  | 

Theorems  3  and  4  give  a  Q(n  +  log2  log2  m)  lower  bound  on  the  program  complexity 
of  perfect  hash  functions.  Whether  n  or  log2  Iog2  m  is  the  significant  term  is  dependent  on 
the  relative  asymptotic  growth  of  n  and  m.  We  will  assume  throughout  our  discussion  that  m 
asymptotically  grows  faster  than  n,  but  not  too  fast,  so  that  the  n  dominates  the  asymptotics. 
More  specifically,  we  will  assume  n2  =  o(m)  and  log2  Iog2  m  =  o(n).  These  bounds  allow  for 

analysis  of  “intermediate"  values  of  m  and  n  such  as  m  =  2V''",  suggested  as  an  open  problem 
in  [Yao|. 

The  lower  and  upper  bounds  described  by  Theorems  3  and  4  have  separately  and  independently 
appeared  in  various  places,  including  [Mehlhorn],  [Berman  et  al.J,  and  [Frcdman  ct  al.]. 

Now  let’s  look  at  searching  an  external  memory.  Suppose  a  page  of  external  memory 
stores  exactly  k  keys.  Decompose  S  into  n/k  pages  B{,  1  <  t  <  n/k.  Membership  queries 
will  now  be  answered  by  examining  a  directory  in  the  internal  memory  containing  enough 
information  to  determine  in  which  page  x  is  found  if  x  is  indeed  in  S.  (If  the  keys  in  each  Bi 
are  sorted,  the  relevant  page  can  be  pulled  and  binary  searched  in  [log2(fc  +  1)]  probes.)  Let 
HFk(m,n,t)  denote  the  bit  complexity  of  the  most  concise  such  directory,  where  t  is  the  size 
of  the  table. 

Theorem  5. 

HFk{m,n,n/k )  >  -  -  |  log2  2*n  +  0 ^  ^ 

The  above  theorem  has  a  corresponding  upper  bound: 

Theorem  6. 

H Fk(m,  n,  n/k )  <  wlo^2,r*  +  log2  n  +  2  log2  log2  m  -  ^  log2  2 *  + 

Proof.  We  proceed  as  in  Theorem  4.  Let  A(P,S)  be  the  property  “P  partitions  S  into  n/k 
parts  with  exactly  k  elements  of  S  in  each  part."  Then  for  fixed  S  and  randomly  chosen  P, 

f.  We.*) «*)-©%  *“..*) 


By  Theorem  2,  we  can  choose  a  set  If  of  partitions  so  that  every  S  €  has  a  partition 
P  €  M  satisfying  A,  where 

nln  m 


1*1 
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The  following  program  will  satisfy  A  for  S  and  takes  loga  |W|  +  log2  n  +  loga  loga  m  +  0(1)  bits 
to  encode. 

PROGRAM  AlmostPcrfects  (x):  / 

/ 

1.  b  *—  floga  m]  written  in  binary; 

2.  j  «—  some  number  between  1  and  |)/|  depending  on  S; 

3.  Search  through  a 11 2*  X  1,2*  X  2,2*  X  3,...  matrices  in  some  lexicographic  order 

with  entries  in  {0, 1, ,  n/k  - 1}  until  a  matrix  M  is  found  satisfying  property  A.  Mxj  gives 
the  block  address  Bx  of  x.  i 

4.  Binary  search  block  Bx  in  loga(k  +  1)  probes,  and  return  “yes "  or  “no"  depending 
on  whether  x  is  found  in  the  block.  | 

By  trading  off  binary  search  and  perfect  hashing,  then,  8(nf/2*)  bits  are  necessary 
and  sufficient  to  encode  a  program  P  answering  membership  queries  in  t  probes  to  the  table. 


§5  Hashing  with  Separate  Chains 

Now  suppose  each  table  slot  is  allowed  to  hold  a  pointer  to  another  table  address  as 
well  as  holding  a  key.  We  would  like  to  know  how  this  additional  information  in  the  table  can 
be  used  to  optimise  the  worst- case  number  of  probes  to  the  table. 

Let  each  search  program  P  initially  probe  one  table  location  to  answer  the  query 
“Is  z  €  ST”  If  z  is  found  at  that  location,  P  answers  “yes."  Otherwise  it  follows  a  chain  of 
pointers  until  x  is  found  on  the  chain,  or  the  end  of  the  chain  is  reached,  answering  “yes”  or 
“no”  accordingly.  This  scheme  is  intended  to  model  the  static  case  of  hashing  with  separate 
chains.  The  static  nature  of  the  problem  allows  the  folding  of  chains  into  the  table,  so  no 
additional  memory  is  needed. 

We  assume  that  static  table  schemes  always  consist  of  separate  chains  of  pointers, 
where  the  program  initially  probes  to  the  first  key  in  the  chain,  and  then  follows  pointers. 
This  assumption  is  not  restrictive  in  terms  of  finding  optimal  search  programs  and  associated 
pointer  structures. 

To  answer  a  query  in  k  probes,  it  means  that  no  chain  in  the  table  can  be  more 
than  k  keys  long.  Each  program  and  table  structure  for  S  €  Af*n*  now  corresponds  to  a 
partition  of  A/  in  which  no  more  tlian  k  keys  in  S  appear  in  any  part.  The  analysis  of  this 
model  is  considerably  more  difficult  than  the  pointer-free  models  wc  have  already  examined, 
as  the  counting  problems  involved  are  much  more  complicated,  and  wc  must  asymptotically 
approximate  their  solution. 

Let  Hk(m,  n)  denote  the  number  of  bits  required  to  define  a  search  strategy  as 
described  earlier,  where  no  chain  of  pointers  in  the  table  haa  length  greater  than  k. 

Theorem  7. 
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Proof,  (sketch)  Let  A{P,S)  be  the  property  “ P  partitions  M  into  n  parts  with  at  most  2 
elements  of  S  found  in  any  part.”  If  P  partitions  M  into  parts  of  site  p„  1  <  »  <  n,  then 


QP  =  (5  G  M^\A{P,S)  holds} 


and  it  is  not  difficult  to  show  that 


Qa  -  ma x({5  €  M(n)\A{P,S)  holds} 


< 

< 


where  (zn)p(z)n  denotes  the  coefficient  of  zn  in  p(z)n.  The  function  g[z)  is  a  probability  mass 
function,  so  that  the  coefficients  of  ff(z)n  represent  the  distribution  of  sums  resulting  from 
n  trials  of  the  random  variable  described  by  g{z).  The  coefficient  of  interest  may  then  be 
recovered  by  use  of  the  Local  Limit  Theorem  for  lattice  distributions,  a  discrete  and  local  form 
of  the  Central  Limit  Theorem  (see  (Feller],  [Petrov]): 

<2'"+'>9(2r'^kexp(^)+ol"‘,+1,)' 

where  y  and  a2  are  the  mean  and  variance  of  g(z),  and  c  is  some  small  constant  greater  than 
zero.  Since  y  5^  <r2  and  y  and  a2  are  both  constant,  the  above  approximation  is  useless  except 
very  close  to  the  mean  (i.e.,  for  small  r);  otherwise  the  0(n-,+3c)  term  swamps  everything. 
However,  this  situation  can  be  remedied  by  .the  technique  of  shifting  the  mean,  described  in 
[Greene  and  Knuth].  We  introduce  a  parameter  a,  and  note 

=  ( V><7(*r • 

If  we  let  a  =  y/2,  then  mean(G)  =  1,  and  the  Local  Limit  Theorem  will  provide  the  required 
asymptotic  information,  since  we  will  be  asking  about  the  distribution  precisely  at  the  mean, 
in  which  case 

Inserting  Ihm  value  into  the  inequality  Qa\M\  >  (™)  gives  the  lower  bound.  | 
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to 


A  tight  upper  bound  can  be  proven  using  the  nonconstructive  argument,  which  we 
give  without  proof: 

Theorem  8.  / 


Hz (m,  n)  <  n  log2  [  — - —  1  +  2  log2  n  +  log2  log2  m  +  0(1). 

U  +  VV 

Probability,  information,  and  combinatorial  theory  share  a  variety  of  asymptotic 
counting  techniques.  To  prove  Theorems  7  and  8,  we  use  the  Local  Limit  Theorem  for  countrig, 
and  not  because  of  its  relation  to  anything  probabilistic.  Given  a  combinatorial  generating 
function  f(z )  =  ]£/nzn  which  is  free  of  singularities,  the  Residue  Theorem  from  complex 

variable  theory  can  be  used  to  determine  particular  coefficients:  (zk)j{z)  =  ^  /  {^dz.  One 
proof  or  the  Local  Limit  Theorem  uses  precisely  this  technique:  we  assume  f(z)  —  g(z)n  where 
g(z)  is  a  probability  mass  function,  and  the  path  of  integration  is  the  unit  circle  on  the  complex 
plane.  Intuitively,  the  unit  circle  path  tends  to  “focus”  the  value  of  the  integral  at  the  mean 
of  g(z)n-  The  method  of  shifting  the  mean  is  a  heuristic  which  allows  the  saddle  point  of  the 
integrand  to  be  “moved”  to  an  advantageous  place  on  the  path. 

We  now  generalize  the  above  methods  to  prove  a  tradeoff  for  arbitrary  Jfc. 

Theorem  9. 


Ih{m,  n)  = 


r»log2e 


^2  ^j  +  ^^^  +  ^CoS^  +  'oglogm). 


Proof,  (sketch)  Let  A(P,  S)  be  the  property  "P  partitions  M  into  n  parts  with  no  mere  than 
k  elements  of  S  found  in  any  part.”  The  quantity  Qa  defined  in  Theorem  1  (and  needed  to 
determine  a  lower  bound)  can  be  bounded  as 


e/<0V>(i  +  *+ 


In  showing  an  upper  bound,  we  find  that  for  randomly  chosen  P  6  p(n\ 


Pr  {A(P,  S)  holds}  =  ^(*n)(l  +  *  +  +  •  •  •  +  . 

Surprisingly,  the  same  generating  function,  given  different  combinatorial  interpretations,  ap¬ 
pears  in  both  the  upper  and  lower  bounds.  The  appearance  of  the  above  generating  function 
in  the  lower  bound  is  related  to  the  Tact  that  Qa  is  maximized  by  partitions  of  the  key  space 
into  equal  sized  parts.  This  equiparliUon  property,  which  gives  a  bound  on  the  entropy  of 
the  search  program,  is  analogous  to  the  cqui partition  properly  of  information  theory  which 
maximises  entropy  in  source  coding. 


L  • 
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To  recover  the  coefficient  of  interest  in  the  above  truncated  exponential,  denoted  as 
G(z),  we  proceed  as  follows: 

•  { zn)G(z)n  is  replaced  by  G(l)n(zn)g(z)n,  where  g(z)  —  G(z)/G{  1)  is  now  a  probability 

mass  function.  (zn)g(z)n  is  then  rewritten  as  ( zn)g(z)n  =  (zn)?j(z)n  where  gfz)  = 

g{az)/g(a). 

•  By  the  Residue  Theorem,  (zn)5(z)n  —  5^/ We  select  as  a  contour  of  integration 
the  path  z  =  eu  around  the  pole  at  z  =  0.  We  now  choose  a  so  that  ag'(a)  =  g(a),  which 
shifts  the  mean  of  g  to  1,  and  thus  the  saddle  point  of  the  contour  integral  to  z  =  1  ./To 
choose  o  satisfying  these  constraints,  we  must  find  the  positive  real  zero  of  a  polynomial  of 
degree  k,  which  (unless  Galois  was  wrong)  cannot  in  general  be  done.  In  this  case,  though, 
we  can  compute  an  asymptotic  approximation  for  the  solution  or  a  =  1  + 

•  Since  fc  may  depend  on  n,  the  Local  Limit  Theorem  cannot  be  used  as  in  Theorem  5. 
Instead,  we  use  the  saddle-point  method  of  complex  variables  [deBruijn].  We  show  the 
existence  of  a  neighborhood  around  z  =  1  where  lnp(z)  converges,  so  that  exp(In$(z))  can 
be  expanded  in  a  convergent  power  series,  and  derive 

(zn)g{z)n  ~^  J_s  «P  ( y  ~  !)**»  ~  -gf*  +  ••  )<*<+  C(Pn), 

where  S  >  0  is  a  small  constant,  0  <  P  <  1  is  a  constant,  and  (/i,<t*,k3,...)  are  the 
semi-invariants  of  <?(z).  Since  p  ~  ^(l)  —  1,  the  first  term  [p  —  l)t'<n  is  zero.  The  proof  is 
completed  by  use  of  Laplace’s  method  Tor  integrals  around  the  saddle  point.  | 

We  note  that  similar  asymptotic  analysis  has  been  used  by  Philippe  Flajolet  to  analyze 
the  expected  behavior  of  extendible  hashing  and  trie  searching  [Flajolet]. 

Corollary  10.  If  k  =  0(1),  so  that  we  insist  or  answering  queries  in  a  constant  number  of 
probes,  then  Hk(m,  n)  =  fl(n). 

Theorem  9  demonstrates  that  when  n  «  (k  +  1)!,  or  equivalently  k  «  s'ze 

of  )i  is  about  a  constant.  Then  for  a  fixed  set  S  6  M^n\  and  randomly  chosen  P  €  P^n\  the 
probability  that  no  more  than  ri|^~  elements  of  5  are  found  in  any  part  in  P  is  large.  This 
fact  is  not  altogether  surprising,  since  it  is  closely  related  to  the  following  classical  problem 
in  random  allocations:  when  throwing  n  balls  at  random  into  n  boxes,  what  is  the  expected 
value  of  the  maximum  number  or  balls  in  any  box?  [Kolchin  cl  al.J[Dinconis  and  Frccdmanj. 
It  turns  out  that  the  expected  value  is  about  y^Lj.  jn  terms  of  hashing  with  separate  chains, 
this  statistic  can  be  interpreted  as  the  expected  length  of  the  longest  probe  sequence,  which 
has  been  closely  analyzed  in  [Gonnet]. 


§6  The  Effect  of  Table  Expansion 

We  have  thus  far  analyzed  two  kinds  or  tradeoffs.  In  Theorem  6,  a  tradeoff  was 
cfTcctcd  by  synthesizing  binary  search  and  perfect  hashing.  Using  this  “paged  hashing”  scheme, 
we  can  answer  membership  queries  in  t  probes  with  a  program  complexity  of  j£L-(l  +  o(i)) 
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bits.  Another  tradeoff  scheme  was  analysed  in  Theorem  9,  where  we  considered  hashing  with 
separate  chains  of  length  at  most  t,  and  a  consequent  maximum  search  lime  of  t  probes.  This 
“chained  hashing”  method  has  a  program  complexity  of  +  o(l))  bits,  which  is  smaller 

than  the  complexity  of  paged  hashing,  though  O(nlogn)  bits  are  needed  to  store  pointers  for 
the  chains.  ( 

What  differences  between  these  schemes  could  cause  their  differing  program  com¬ 
plexities?  The  paged  hashing  method  has  an  address  space  of  n/fc,  with  exactly  fc  keys  stored 
at  each  address  (page).  The  chained  hashing  method  has  an  address  space  of  n,  with  at  most 
k  keys  stored  at  any  address. 

Suppose  we  modified  paged  hashing  so  that  its  address  space  was  expanded,  say, 
from  n/k  to  n,  and  maintained  the  constraint  that  exactly  k  keys  are  stored  at  some  n/k  of 
the  n  addresses.  How  would  this  modification  alter  the  lower  bound  on  program  complexity? 
Mehlhorn  has  shown  that  perfect  hashing  with  a  load  average  of  /?  has  a  program  complexity 
of  @(/7n  +  log  log  m).  However,  when  the  paged  hashing  method  is  used  with  similar  table 
expansion,  and  k  grows  asymptotically  large,  no  such  similar  reduction  occurs  in  the  program 
complexity.  In  particular,  we  can  show  the  following  theorem: 

Theorem  11.  Let  HFk(m,n,t)  be  the  program  complexity  of  the  paged  hashing  method. 
Then  for  increasing  k,  HFk(m,n,t)  >  "  *  (1  +  o(l)). 

Theorem  11  demonstrates  that  the  program  complexity  of  the  paging  method  is  not 
significantly  reduced  by  an  expansion  of  the  address  space,  since  the  lower  bound  is  asymptoti¬ 
cally  equivalent  to  the  lower  bound  found  in  Theorem  5,  i.e.,  IlFk(m,n,t)  =  f\{Il  Fk(m,n,n/k)) 
for  all  t  >  n/fc. 


§7  Probabilistic  Hashing  and  Hash  Circuits 

We  now  give  an  application  of  the  above  analysis  to  showing  a  lower  bound  on  the 
worst-case  behavior  of  universal  classes  of  hash  functions  (see  [Carter  and  Wcgman]). 

A  class  H  of  hash  functions  is  called  universal 3  if  for  any  x,y  €  M,  no  more  than 
|)/|/n  of  the  functions  h  £  X  satisfy  h(x)  ==  h(y).  Carter  and  Wegman  essentially  showed  that 
by  choosing  an  h  £  M  randomly,  we  can  answer  “Is  x  £  5?”  in  0(1)  probes  on  the  average. 
More  specifically,  they  showed  that  for  any  x  £  M  and  randomly  chosen  h  £  H,  the  expected 
number  of  y  £  S  colliding  with  z  under  h  is  1.  In  various  applications  it  is  advantageous  to 
minimize  the  size  of  the  class  of  hash  functions:  in  [Mehlhorn]  it  is  shown  that  the  smallest 
universal  class  of  hash  functions  has  size  6(n  logn  m).  Wc  can  then  show  the  following  theorem: 

Theorem  12.  Let  H  be  a  class  of  universal  hash  functions  of  minimal  size.  Then  there  exists 
a  set  5  €  Af(n)  depending  on  M  such  that  the  following  is  true:  for  any  h  £  M  there  exists  a 
set  S'  C  S  where  x,  y  £  S'  implies  h(x)  =  A(y),  and  |S'|  =  Q(log  n/  log  log  n). 

This  theorem  implies  that  if  wc  use  a  “minimal"  class  of  universal  hash  functions, 
where  wc  choose  any  function  from  the  class  and  construct  an  accompanying  table  structure, 

there  will  always  be  queries  that  lake  Pr°l>0*  10  answer,  far  worse  that  the  0(1) 

expected  time  derived  from  choosing  randomly.  Probabilistic  algorithms  arc  designed  to  give 
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a  good  expected  time  behavior  independent  of  the  input  by  randomization  in  the  algorithm. 
Theorem  12  shows  how  in  some  sense  the  randomization  can  be  “beat,”  since  the  proof  gives  a 
lower  bound  on  choosing  the  best  function  as  opposed  to  a  random  one,  and  choosing  randomly 
cannot  do  any  better  than  choosing  the  best  function.  It  also  suggests  that  a  “minima}” 
program  cannot  do  much  better  than  binary  search,  which  was  shown  by  jYao]  for  a  very  large 
keyspace. 

The  lower  bound  of  Theorem  12  is  on  a  very  worst  case  measure  of  the  performance  of 
universal  classes  of  hash  functions,  but  docs  show  that  their  expected  behavior  is  not  completely 
input  independent.  These  functions  still  can  perform  well  in  an  amortized  sense  (say  over  a 
sequence  of  queries  over  all  x  in  5),  but  this  is  just  a  hidden  form  of  averaging  over  an  input 
sequence.  ;/ 

A  similar  bound  can  be  proven  about  certain  classes  of  hash  circuits.  Define  a 
computation-bounded  hash  circuit  as  a  directed  acyclic  graph  where  the  vertices  of  the  graph 
represent  Boolean  operations  with  fan-in  bounded  by  a  constant,  and  fan-out  of  1.  The 
computation  is  additionally  bounded  by  requiring  that  the  circuit  can  use  an  input  bit  (i.e.,  a 
bit  from  the  key  x)  no  more  than  a  constant  number  of  times.  We  envision  the  latter  condition 
by  providing  a  constant  number  of  copies  of  each  input  bit  to  the  circuit,  which  may  be  “used” 
or  “ignored”  by  the  circuit.  The  outputs  of  some  fixed  log2  n  of  the  vertices  are  chosen  as  the 
bits  of  the  hash  address. 

Theorem  13.  Let  M  be  a  class  of  computation-bounded  hash  circuits,  and  assume  that 
loga  m  log2  loga  m  =  o(n).  Then  there  exists  a  set  S  6  A/(n)  depending  on  V  such  that  the 
following  is  true:  for  any  h  €  k  there  exists  a  set  S'  Q  S  where  x,y  £  S'  implies  h(x)  —  h(y), 
and  |S'|  =  fl(log n/  log  log n). 


§8  Open  Problems 

We  conjecture  that  an  easy-to-compuce  hash  function  with  small  program  complexity 
can  be  constructed  for  any  fixed  set  S  €  which  has  no  more  than  Cf  -  jn)  collisions  to 
any  address. 

Another  topic  that  merits  further  investigation  is  adaptive  tradeoffs.  The  tradeoff 
schemes  we  have  considered  have  mostly  been  nonadaptivc.  For  example,  in  hashing  with 
separate  chains,  the  information  stored  in  the  search  program  P  is  used  to  choose  the  right 
chain,  but  subsequent  probes  do  not  adapt  to  information  gained  by  earlier,  failed  probes 
(except  that  they  failed).  The  lower  bound  in  [YaoJ  analyzes  some  adaptive  search  models. 
These  models  do  not  allow  the  search  program  to  have  knowledge  in  advance  about  the  table  it 
searches,  and  so  the  analysis  is  incompatible  with  the  approach  taken  here.  What  can  be  done 
to  unify  these  two  approaches?  One  promising  line  of  attack  might  be  to  extend  the  approach 
of  Gonnet  and  Larson:  consider  an  algorithm  used  for  insertion,  and  show  some  entropic  bound 
on  its  behavior  which  indicates  how  much  information  must  be  passed  to  the  search  algorithm. 

The  analysis  or  adaptive  search  leads  naturally  to  more  dynamic  considerations.  What 
can  be  said  about  the  program  complexity  of  search  strategics  that  modify  themselves  as  keys 
are  inserted  and  deleted  from  the  table?  This  question  is  probably  very  difficult. 

Hash  circuits  is  an  additional  area  where  open  problems  remain.  How  strong  is  the 
lower  bound  of  Theorem  5.1?  If  our  above  conjecture  about  easy  to  compute  number- theoretic 
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functions  is  true,  the  lower  bound  should  be  quite  strong.  However,  VLSI  models  suggest  chip 
area  and  time,  rather  than  gate  complexity,  as  measures  of  the  performance  of  circuits.  Can 
an  AT3  lower  bound  be  shown  for,  say,  a  circuit  which  is  a  perfect  hash  function?  What  kind 
of  information  transfer  must  occur  in  a  hash  circuit? 
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